1 Overview

This is a fast-paced course that covers a lot of material. There will be a large amount of references. You may need to do your own research to fill in the gaps in between lectures and homework/projects. It is impossible to learn data science without getting your hands dirty. Please budget your time evenly. Last-minute work ethic will not work for this course.

Homework in this course is different from your usual homework assignment as a typical student. Most of the time, they are built over real case studies. While you will be applying methods covered in lectures, you will also find that extra teaching materials appear here. The focus will be always on the goals of the study, the usefulness of the data gathered, and the limitations in any conclusions you may draw. Always try to challenge your data analysis in a critical way. Frequently, there are no unique solutions.

Case studies in each homework can be listed as your data science projects (e.g. on your CV) where you see fit.

1.1 Objectives

  • Get familiar with R-studio and RMarkdown
  • Hands-on R
  • Learn data science essentials
    • gather data
    • clean data
    • summarize data
    • display data
    • conclusion
  • Packages
    • dplyr
    • ggplot

1.2 Instructions

  • Homework assignments can be done in a group consisting of up to three members. Please find your group members as soon as possible and register your group on our Canvas site.

  • All work submitted should be completed in the R Markdown format. You can find a cheat sheet for R Markdown here For those who have never used it before, we urge you to start this homework as soon as possible.

  • Submit the following files, one submission for each group: (1) Rmd file, (2) a compiled HTML or pdf version, and (3) all necessary data files if different from our source data. You may directly edit this .rmd file to add your answers. If you intend to work on the problems separately within your group, compile your answers into one Rmd file before submitting. We encourage that you at least attempt each problem by yourself before working with your teammates. Additionally, ensure that you can ‘knit’ or compile your Rmd file. It is also likely that you need to configure Rstudio to properly convert files to PDF. These instructions might be helpful.

  • In general, be as concise as possible while giving a fully complete answer to each question. All necessary datasets are available in this homework folder on Canvas. Make sure to document your code with comments (written on separate lines in a code chunk using a hashtag # before the comment) so the teaching fellows can follow along. R Markdown is particularly useful because it follows a ‘stream of consciousness’ approach: as you write code in a code chunk, make sure to explain what you are doing outside of the chunk.

  • A few good or solicited submissions will be used as sample solutions. When those are released, make sure to compare your answers and understand the solutions.

1.3 Review materials

  • Study Basic R Tutorial
  • Study Advanced R Tutorial (to include dplyr and ggplot)
  • Study lecture 1: Data Acquisition and EDA

2 Case study 1: Audience Size

How successful is the Wharton Talk Show Business Radio Powered by the Wharton School

Background: Have you ever listened to SiriusXM? Do you know there is a Talk Show run by Wharton professors in Sirius Radio? Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. Within a short period of time the general reaction seemed to be overwhelmingly positive. To find out the audience size for the show, we designed a survey and collected a data set via MTURK in May of 2014. Our goal was to estimate the audience size. There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\).

To do so, we launched a survey via Amazon Mechanical Turk (MTurk) on May 24, 2014 at an offered price of $0.10 for each answered survey. We set it to be run for 6 days with a target maximum sample size of 2000 as our goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.

We requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, we opened the survey to everyone in MTurk with a hope that the sample would be more randomly chosen.

The raw data is stored as Survey_results_final.csv on Canvas.

2.1 Data preparation

  1. We need to clean and select only the variables of interest. Select only the variables Age, Gender, Education Level, Household Income in 2013, Sirius Listener?, Wharton Listener? and Time used to finish the survey.

Change the variable names to be “age”, “gender”, “education”, “income”, “sirius”, “wharton”, “worktime”.

  1. Handle missing/wrongly filled values of the selected variables

As in real world data with user input, the data is incomplete, with missing values, and has incorrect responses. There is no general rule for dealing with these problems beyond “use common sense.” In whatever case, explain what the problems were and how you addressed them. Be sure to explain your rationale for your chosen methods of handling issues with the data. Do not use Excel for this, however tempting it might be.

Tip: Reflect on the reasons for which data could be wrong or missing. How would you address each case? For this homework, if you are trying to predict missing values with regression, you are definitely overthinking. Keep it simple.

There are invalid ages, such as “female”, eighteen. Some fields were also empty. In the education column, some people responded with “select one”, which is an invalid option.

  1. Brief summary

Write a brief report to summarize all the variables collected. Include both summary statistics (including sample size) and graphical displays such as histograms or bar charts where appropriate. Comment on what you have found from this sample. (For example - it’s very interesting to think about why would one work for a job that pays only 10cents/each survey? Who are those survey workers? The answer may be interesting even if it may not directly relate to our goal.)

## Warning: NAs introduced by coercion

For age, there are some unreasonable ages such as 4 and 223. Someone put their age as”female”. Another person wrote their age was “Eighteen (18)” which needs to be changed to just 18. Some are missing ages which could be because the person forgot to put it in or did not want to give their age away. These entries can be deleted. For gender, some people did not put in a gender. This could be because the options were Female or Male and they do not identify with either. For education, someone entered in “select one”, which is likely an error, and the person forgot to answer the question. For income, sirius, wharton, and worktime, people forgot to answer the question. Removing every entry with missing or clearly mis-entered data does not affect the size of the dataset much, so this is a convenient method of handling mis-entered data.”

  1. Brief summary

Write a brief report to summarize all the variables collected. Include both summary statistics (including sample size) and graphical displays such as histograms or bar charts where appropriate. Comment on what you have found from this sample. (For example - it’s very interesting to think about why would one work for a job that pays only 10cents/each survey? Who are those survey workers? The answer may be interesting even if it may not directly relate to our goal.)

The average age of participants was 30.4 years, and the median was 28.0 years old. Overall, the majority of the participants had ages between 18 and 35 years old. The participants were more likely to be male than female with males taking up 58% of the valid survey entries. Each of the income intervals (15k-30k, 30k-50k, 50k-75k and 75k-150k) were about 1/5th or 1/4th of the participants. Almost of the participants had some college, and very few had no diploma. One third had completed a bachelors program and around 2/5th had some college without a degree or an associates degree. Most participants listened to Sirius but not Wharton, with 4.996% of the Sirius listeners also listening to Wharton

2.2 Sample properties

The population from which the sample is drawn determines where the results of our analysis can be applied or generalized. We include some basic demographic information for the purpose of identifying sample bias, if any exists. Combine our data and the general population distribution in age, gender and income to try to characterize our sample on hand.

  1. Does this sample appear to be a random sample from the general population of the USA?

The sample does not appear to be a random sample from the general population of the USA. Younger people between the ages of 20-35 are over-represented while children and retirees are extremely under-represented. The sample also has a much larger percentage of people who are male with a percentage of 57% as compared with the US population, which is around 49.5% male. Income does seem to represent the general population of the USA, as more than 50% of population make over $50,000, which agrees with US Census. The sample also has a much larger proportion of people who attended some college and graduated with a bachelors than the general US population, and a much smaller proportion of people who only have a high school diploma.

  1. Does this sample appear to be a random sample from the MTURK population?

Note: You can not provide evidence by simply looking at our data here. For example, you need to find distribution of education in our age group in US to see if the two groups match in distribution. You may need to gather some background information about the MTURK population to have a slight sense if this particular sample seem to a random sample from there… Please do not spend too much time gathering evidence.

No, it does not. Comparing this sample collected in 2014 with the 2020 demographics of MTURK which is readily available online, the MTURK population is majority female (57%) whereas our sample was only 43% female. The MTURK population also has a much more even spread of ages with much better representation of people 40 years and older.

2.3 Final estimate

Give a final estimate of the Wharton audience size in January 2014. Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population. Write a brief executive summary to summarize your findings and how you came to that conclusion.

In order to learn how many listeners the Wharton Business Radio show has, an MTURK survey was released several months after the show started. It was open for six days for anyone on the MTURK platform. Each survey paid 10 cents for its completion. The survey collected demographic information such as age, gender and income, and asked if the reader ever listened to SiriusXM or the Business Radio. It was requested that only people in the United States filled out the survey and each person could only fill it out once. After six days, 1,764 surveys were collected. All surveys with blank or missing entries were deleted. Some ages had non-numeric characters, were unreasonably low or high, or were not numbers. Those that could be fixed by removing non-numeric characters were corrected and the rest of the problematic surveys were deleted. Entries that claimed to listen to the Wharton show without listening to Sirius were also deleted. The number of Sirius listeners was 51.6 million. According to the survey, 4.996% of the Sirius listeners also listened to Wharton, so assuming the sample was representative of the Sirius listeners population, 2.58 million people listened to the Wharton show. Some limitations of the study include issues due to the survey respondents and the question asked. The survey respondents are not representative of the US population, with the men, college graduates, and younger people overrepresented. Many younger people may listen to the radio but may be less interested in a business talk show, which may decrease the calculated percentage of listeners who listen to the Wharton show. Men and college graduates may be more likely to listen to a business show on the radio which could increase the calculated percentage of listeners who listen to the Wharton show.

To be specific, you should include:

  1. Goal of the study
  2. Method used: data gathering, estimation methods
  3. Findings
  4. Limitations of the study.

2.4 New task

Now suppose you are asked to design a study to estimate the audience size of Wharton Business Radio Show as of today: You are given a budget of $1000. You need to present your findings in two months.

Write a proposal for this study which includes:

  1. Method proposed to estimate the audience size.

    We will conduct a random sample to estimate what percentage of the population uses Sirius XM, and also ask if they listen to the Wharton Business Radio Show in a very similar fashion to this case study. However, to ensure that survey is a simple random survey, we can follow procedures like calling randomly generated phone numbers ta random times of the day, or mail out letters to randomly selected addresses from randomly selected regions. Sending out our survey in more than one way (e.g. online, in person, across many different environments) will help to guarantee that our survey results are representative of the overall population. We can confirm that this process is correct by comparing distributions of features in the data like gender, age, income, etc.

    After data is collected, we can follow a similar procedure to the estimation calculations done above to find what percentage of responses listen to Sirius XM/what percentage listens to the Wharton Business Radio Show and extrapolate an estimate based on ratios.

  2. What data should be collected and where it should be sourced from. Please fill in the google form to list your platform where surveys will be launched and collected HERE

A good proposal will give an accurate estimation with the least amount of money used.

3 Case study 2: Women in Science

Are women underrepresented in science in general? How does gender relate to the type of educational degree pursued? Does the number of higher degrees increase over the years? In an attempt to answer these questions, we assembled a data set (WomenData_06_16.xlsx) from NSF about various degrees granted in the U.S. from 2006 to 2016. It contains the following variables: Field (Non-science-engineering (Non-S&E) and sciences (Computer sciences, Mathematics and statistics, etc.)), Degree (BS, MS, PhD), Sex (M, F), Number of degrees granted, and Year.

Our goal is to answer the above questions only through EDA (Exploratory Data Analyses) without formal testing. We have provided sample R-codes in the appendix to help you if needed.

3.1 Data preparation

  1. Understand and clean the data

Notice the data came in as an Excel file. We need to use the package readxl and the function read_excel() to read the data WomenData_06_16.xlsx into R.

a). Read the data into R.

b). Clean the names of each variables. (Change variable names to Field,Degree, Sex, Year and Number )

c). Set the variable natures properly.

d). Any missing values?

According to the output of the above functions, there are no missing values.

  1. Write a summary describing the data set provided here.

a). How many fields are there in this data?

There are 10 unique fields

b). What are the degree types?

  • The degree types are BS, MS, PhD

c). How many year’s statistics are being reported here?

  • There are years from 2006 to 2016, so 11 year’s statistics are being reported

3.2 BS degrees in 2015

Is there evidence that more males are in science-related fields vs Non-S&E? Provide summary statistics and a plot which shows the number of people by gender and by field. Write a brief summary to describe your findings.

Since we are looking at comparison between science related fields and non-science related fields, we will group the degrees into 2 categories: science related fields and non-science related fields, where non-science related are NON-S&E fields and every other fields are science related.

The graphs show that there are slightly more males in science fields than there are females in science fields, but the difference is not large. In absolute numbers, we conclude that there are slightly more men in science-related fields. However, there are both more men and women in not science-related fields than there are in science-related fields. However, the percentage of women in science-related fields is still much less than the percentage of men in science-related fields, as shown by the pie charts. Overall, there are more women getting degrees than men.

3.3 EDA bringing type of degree, field and gender in 2015

Describe the number of people by type of degree, field, and gender. Do you see any evidence of gender effects over different types of degrees? Again, provide graphs to summarize your findings.

## `summarise()` has grouped output by 'Field'. You can override using the
## `.groups` argument.

First, we will analyze the number of students in different degrees, what type of degree they are pursuing, and what their major is, split by gender.

From the first plot, there are more B.S. degrees given out for either gender verses MS’s or PhD’s. Non-science/engineering majors are the most popular.

For the second plot, after omitting non-science/enginerring majors from the barplot, it becomes easier to see trends in the science/engineering majors. For men, across all types of degrees, Engineering and social sciences are the most popular Non-S&E major. For women, among S&E, Social Sciences, Psychology, and Biology are the most popular field for each degree type. For both genders, Physical Sciences and Earth sciences are the least popular fields.

The graph shows that the gender split for each field is relatively constant throughout BS, MS and PhD degrees. Female dominated fields such as Psychology, Social Sciences, Biological Sciences, and Agricultural Sciences remain female dominated, whereas Computer Sciences, Earth Sciences, Engineering, Math/Stat and Physical Sciences remain male-dominated.

There are more Females in Non-S&E fields across all degree types in 2015.

Conclusion: There is evidence that there are more males than females in S&E fields by percentage and by sheer number.

3.4 EDA bring all variables

In this last portion of the EDA, we ask you to provide evidence numerically and graphically: Do the number of degrees change by gender, field, and time?

## `summarise()` has grouped output by 'SE', 'Sex', 'Year'. You can override using
## the `.groups` argument.

## `summarise()` has grouped output by 'SE', 'Sex', 'Year'. You can override using
## the `.groups` argument.

We see that there are about 50% of males and 50% of females in SE fields in BS, and percentage of males increases slightly while the percentage of females decreases slightly for MS and PhD. Looking at Non S&E graphs, the percentage of males and females overall seems to be equal throughout types of degrees, and there are more Females getting degrees in those fields. For both S&E and Non-S&E, the proportion of degrees given to males and females remains constant over time, showing that the percentage of women in sciences and engineering did not increase between 2006-2016.

We will now look at degree and gender distribution in relation to time.

## `summarise()` has grouped output by 'Sex', 'Degree'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'SE', 'Sex'. You can override using the
## `.groups` argument.

This graph tells up that the ratio on Females in Non-S&E fields has been relatively constant and started to go up slowly, where it hit highers in 2016, but there is not a great difference of proportion of females in 2006 vs 2016. Proportion of Females in S&E fields is not as constant, it went up int the year 2011-2012, and then started to go down, where it hit its lowest in 2016. Overall, the proportion of females decreased by a little since 2006 in S&E fields.

We will now analyze proportion of Females based on degree in relation to time.

Based on this graph, it is clear that there are fields where there are far more men than women (such as Engineering), and there are also fields where there are far more women than males (Psychology), looking at B.S degrees. Within BS degrees, there are some fields where over 50% of people are female but there are fields with over 50% of people who are males as well, so B.S degree contains as many women as men, proportionally.

Looking at M.S fields, disregarding Non-S&E, there seem to be more males than females. Even though there are fields wehre there are more females than males, there are more fields where there are more male than females.

Looking at PhD, there is also a larger proportion of males than females.

3.5 Women in Data Science

Finally, is there evidence showing that women are underrepresented in data science? Data science is an interdisciplinary field of computer science, math, and statistics. You may include year and/or degree.

## `summarise()` has grouped output by 'Sex', 'Degree'. You can override using the
## `.groups` argument.

<<<<<<< HEAD The graph above shows that the number of males getting degrees is larger than the number of females across years and fields. Moreover, and this feature is most evident in B.S and M.S degrees, where number of males and females getting a degree in Data Science increases, and the rate at which number of males increases is equal to the rate at which number of females increases, with the number of males always being larger than that of females. The second graph shows that over time, the proportion of females in data science is relatively constant. The proportion increases slightly for MS degrees and decreases slightly for PhD degrees. Therefore, there is evidence showing that women are underrepresented in data science.

To compare fields where there are more females than males vs more males than females, we will look at Engineering and Psychology fields.

The graphs show that there are significantly more females than males in Psychology, and there are significantly more males than females in Engineering. ## Final brief report

Summarize your findings focusing on answering the questions regarding if we see consistent patterns that more males pursue science-related fields. Any concerns with the data set? How could we improve on the study?

When looking at Science and engineering fields as a whole, we do see evidence that, overall, more males pursue science-related fields than women. Overall, a greater proportion of men were granted science-related fields than women, and the absolute number of men who were granted science-related degrees is greater than that of females. Looking at individual science-related fields, it is clear that Psychology, Social Sciences, Biological Sciences, and Agricultural Sciences are dominated by women, whereas Computer Sciences, Earth Sciences, Engineering, Math/Stat and Physical Sciences are male dominated. These trends hold true for all degree types and the gender split for each field holds relatively constant over 2006-2016, so more effort should be put into diversifying these fields. It is also clear that the more advanced degrees are more male dominated, whereas BS’s are granted to females and males almost equal numbers.

Data set: we need to have more representation of various racial and ethnic groups by increasing the sample size. One way to do this would be to scrape universities’ reports for more complete reports of information on majors, genders, race and ethnic groups. This will be relatively inexpensive because there are no human costs that need to be paid.

3.6 Appendix

To help out, we have included some R-codes here as references. You should make your own chunks filled with texts going through each items listed above. Make sure to hide the unnecessary outputs/code etc.

  1. Clean data

  2. A number of sample analyses

4 Case study 3: Major League Baseball

We would like to explore how payroll affects performance among Major League Baseball teams. The data is prepared in two formats record payroll, winning numbers/percentage by team from 1998 to 2014.

Here are the datasets:

-MLPayData_Total.csv: wide format -baseball.csv: long format

Feel free to use either dataset to address the problems.

4.1 EDA: Relationship between payroll changes and performance

Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.

Create increment in payroll

a). To describe the increment of payroll in each year there are several possible approaches. Take 2013 as an example:

Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.

4.2 Do log increases in payroll imply better performance?

We are interested in the relationship between payroll differences and performance. We are interested in the amount that payrolls increase from last year. One way we can do this is through percentage changes. However, measuring percentage changes is not symmetric going forward and backward (e.g. percentage difference between 2013 payroll and 2012 payroll is not the same as the difference between 2012 payroll and 2013 payroll). Log differences are symmetric going forward and backward and allow for a better way to measure change in payroll. Additionally, the sum of log differences over a series of numbers is the same as the log difference between the first and last observation, so it retains some of the usefulness of just differences.

b). Create a new variable diff_log=log(payroll_2013) - log(payroll_2012). Hint: use dplyr::lag() function.

c). Create a long data table including: team, year, diff_log, win_pct

4.3 Exploratory questions

a). Which five teams had highest increase in their payroll between years 2010 and 2014, inclusive?

Los Angeles Dodgers (2013), Miami Marlins (2012), Houston Astros (2014), Kansas City Royals (2012), Texas Rangers (2011)

b). Between 2010 and 2014, inclusive, which team(s) “improved” the most? That is, had the biggest percentage gain in wins?

The table above lists teams with the greatest improvement in decreasing order.

4.4 Do log increases in payroll imply better performance?

<<<<<<< HEAD - It is not necessarily true that increases in payroll imply better performance. For example, in 2011 the Arizona Diamondbacks actually experienced a decrease in payroll but a larger increase in the percent of games won.

Is there evidence to support the hypothesis that higher increases in payroll on the log scale lead to increased performance?

Pick up a few statistics, accompanied with some data visualization, to support your answer.

## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

This graph plots the log difference in pay against increases in performance over years. There is no clear pattern, so it is not true that larger pay increases correlate with larger performance increases. For example, in 2009, the performance decreased as payroll increased, but in 1996, performance increased as payroll decreased.

4.5 Comparison

Which set of factors are better explaining performance? Yearly payroll or yearly increase in payroll? What criterion is being used?

## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

The top graph visualizes multiple linear regressions of win percentage (measure of performance) against payroll. The last graph visualizes multiple linear regressions of win percentage against log payroll difference. It is apparent that yearly payroll is better at explaining performance, as there is a more pronounced and consistent positive trend across years. However, in recent years there is not a substantial positive correlation between yearly payroll and win percentage. The top graph shows that as years go by, the line starts to flatten out, meaning that performance is the same regardless of payroll. However, we do see a more consistent trend in the top graph than in the bottom one. Therefore, payroll is better.